Skip to content

Fixup merge master into Chinese Word Segmentation#20055

Merged
seanbudd merged 9 commits into
nvaccess:try-chineseWordSegmentation-stagingfrom
CrazySteve0605:fixup-mergeMaster
May 11, 2026
Merged

Fixup merge master into Chinese Word Segmentation#20055
seanbudd merged 9 commits into
nvaccess:try-chineseWordSegmentation-stagingfrom
CrazySteve0605:fixup-mergeMaster

Conversation

@CrazySteve0605
Copy link
Copy Markdown
Contributor

Link to issue number:

Summary of the issue:

Description of user facing changes:

Description of developer facing changes:

Description of development approach:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

Comment thread projectDocs/issues/githubIssueTemplateExplanationAndExamples.md
Comment thread projectDocs/issues/readme.md
Copy link
Copy Markdown
Contributor

@cary-rowen cary-rowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this an unexpected change?

@cary-rowen
Copy link
Copy Markdown
Contributor

cary-rowen commented May 5, 2026

Hi @CrazySteve0605, I reviewed this PR against the Copilot comments from #19166. Most of the explicit comments are handled correctly, but I think two issues still need attention.

First, the braille offset-converter issue is still unresolved. braille.py still keeps only one converter. When Chinese word-segmentation spacing is applied first and Unicode normalization is also enabled, the segmentation converter is overwritten by the normalization converter. As a result, brailleToRawPos / rawToBraillePos are only mapped back through the last converter, so cursor routing / selection mapping can still be wrong. This probably needs converter composition, or applying the transformations while keeping a combined mapping back to the original raw text.

Second, _initCppJieba now calls cls._lib.initJieba(dictDir), but it does not check the returned boolean. Since the C++ side now returns false on initialization failure, the Python side should probably treat that as failure too, e.g. set cls._lib = None and log/debugWarn. Otherwise NVDA may think cppjieba is available even though it was not initialized successful.

The rest of the fixes look broadly in the right direction to me.

@cary-rowen
Copy link
Copy Markdown
Contributor

I will test the actual experience with Focus 80 later. The above is just a response to some points that Copilot raised that may need to be considered.

CrazySteve0605 and others added 3 commits May 5, 2026 16:48
- use a list of converters for improved processing
- add unit test for Chinese word segmentation and Unicode normalization offsets
@cary-rowen
Copy link
Copy Markdown
Contributor

cary-rowen commented May 5, 2026

Hi @CrazySteve0605

While testing Chinese braille word segmentation, I found a regression: some NVDA built-in braille state abbreviations are being split by the word segmentation logic. For example, the checked state for a checkbox should remain ⣏⣿⣹, but with this feature enabled it becomes ⣏ ⣿ ⣹, with unexpected spaces inserted between the braille cells.

This seems to happen because WordSegWithSeparatorOffsetConverter is applied to the entire rawText in braille.Region.update(). That text contains not only user-facing content, but also NVDA-generated braille state abbreviations. The current segmentation logic avoids inserting spaces next to punctuation, but Braille Pattern characters are Unicode symbols, so they are not protected by that rule.

I think this should be fixed in this PR. A reasonable approach would be to avoid inserting word-segmentation separators between Braille Pattern characters, or more generally avoid inserting separators across symbol boundaries. It would also be good to add regression tests to ensure ⣏⣿⣹ remains unchanged, while normal Chinese text is still segmented as expected.

Btw, please remember to remove irrelevant .md file changes from the changes.

Thanks

@cary-rowen
Copy link
Copy Markdown
Contributor

Regarding Braille status abbreviations, please refer to this section in the user guide.

@cary-rowen
Copy link
Copy Markdown
Contributor

Thanks @CrazySteve0605

The fix works for me.
It's ready for review?

@CrazySteve0605 CrazySteve0605 marked this pull request as ready for review May 9, 2026 02:42
@CrazySteve0605 CrazySteve0605 requested review from a team as code owners May 9, 2026 02:42
@CrazySteve0605 CrazySteve0605 requested review from Qchristensen and seanbudd and removed request for a team May 9, 2026 02:42
@CrazySteve0605
Copy link
Copy Markdown
Contributor Author

Hi @cary-rowen , thanks for reviews and testing.

@seanbudd seanbudd merged commit fab3060 into nvaccess:try-chineseWordSegmentation-staging May 11, 2026
34 of 37 checks passed
@github-actions github-actions Bot added this to the 2026.2 milestone May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants